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Abstract — We consider the problem of detecting a small subset 
of defective items from a large set via non-adaptive "random 
pooling" group tests. We consider both the case when the mea- 
surements are noiseless, and the case when the measurements 
are noisy (the outcome of each group test may be independently 
faulty with probability q). Order-optimal results for these sce- 
narios are known in the literature. We give information-theoretic 
lower bounds on the query complexity of these problems, and 
provide corresponding computationally efficient algorithms that 
match the lower bounds up to a constant factor. To the best of 
our knowledge this work is the first to explicitly estimate such a 
constant that characterizes the gap between the upper and lower 
bounds for these problems. 

I. Introduction 

The goal of group testing is to identify a small unknown 
subset T> of defective items embedded in a much larger set 
A/" (usually in the setting where \D\ is much smaller than 
| A/], i.e., \D\ is o(|A/"|)). This problem was first considered by 
Dorfman [ 1 j in scenarios where multiple items in a group can 
be simultaneously tested, with a binary output depending on 
whether or not a "defective" item is presented in the group 
being tested. In general, the goal of group testing algorithms 
is to identify the defective set with as few measurements 
as possible. As demonstrated in [1] and future work, with 
judicious grouping and testing, far fewer than the trivial upper 
bound of | A/" | may be required to identify the set of defective 
items. 

In this work our model has four important assumptions. 

• Non-adaptive group testing: The set of items being 
tested in each test is required to be independent of the 
outcome of every other test. This restriction is often 
useful in practice, since this enables parallelization of 
the testing process. It also allows for an automated 
testing process (whereas the procedures and especially 
the hardware required for adaptive group testing may be 
significantly more complex). Furthermore, it is known 
(for instance (2), Q) that adaptive group testing algo- 
rithms do not improve upon non-adaptive group-testing 
algorithms by more than a constant factor in the number 
of tests required to identify the set of defective items. 

• "Small-error" group testing: Our algorithms are allowed 
to have a "small" probability of error. It is known (for in- 
stance [3]) that zero-error algorithms require significantly 



more tests asymptotically (in the number of defective 
items) than algorithms that allow asymptotically small 
errors. 

• "Noisy" measurements: In addition to the noiseless 
group-testing problem specified by the above, we also 
consider the "noisy" variant of the problem, wherein 
the result of each test may differ from the true result 
(in an independent and identically distributed manner) 
with a certain pre-specified probability q. y\ Since the 
measurements are noisy, the problem of estimating the 
set of defective items is more challenging, and is known 
to require more tests, [j 

• Computationally efficient and near-optimal algorithms: 
Most algorithms in the literature focus on optimizing the 
number of measurements required - in some cases, this 
leads to algorithms that may not be computationally effi- 
cient to implement (for e.g. |3|). In our work we require 
our algorithms be both computationally efficient and near- 
optimal in the number of measurements required. 

In the literature, order-optimal upper and lower bounds on 
the number of tests required are known for the problems we 
consider (for instance [3|, |7|). In both the noiseless and noisy 
variants, the number of measurements required to identify the 
set of defective items is known to be T = 0(dlog(n)) - here 
n = \Af\ is the total number of items and d — |2?| is the size of 
the defective subset. However, in the noisy variant, the number 
of tests required is in general a constant factor larger than in 
the noiseless case (where this constant j3 is independent of 
both n and d, but may depend on the noise parameter q and 



'An alternative model involving "worst-case" errors has also been con- 
sidered in the literature (for instance [4]), wherein the designed group- 
testing algorithm is required to be resilient to all noise patterns wherein at 
most a fraction q of the results differ from their true values, rather than 
the probabilistic guarantee we give against most fraction-q errors. This is 
analogous to the difference between combinatorial coding-theoretic error- 
correcting codes (for instance Gilbert- Varshamov codes |5|) and probabilistic 
information-theoretic codes (for instance [61). In this work we concern 
ourselves only with the latter, though it is possible that our techniques can 
also be used to analyzed the former. 

2 We wish to highlight the difference between noise and errors. We 
use the former term to refer to noise in the outcomes of the group-test, 
regardless of the group-testing algorithm used. The latter term is used 
to refer to the error due to the estimation process of the group-testing 
algorithm. 



the allowable error-probability 8 of the algorithm. 

However, to the best of our knowledge, prior to this work 
no explicit characterization has been given of the actual 
number of measurements required (rather than just order- 
optimal results). In particular, we analyze two algorithms that 
we call Combinatorial Basis Pursuit (CBP), and Combinatorial 
Orthogonal Matching Pursuit (COMP)]^] that have both been 
previously considered in the group-testing literature (under 
different names) for both noiseless and noisy scenarios (see, 
for instance, (U). We provide explicit upper bounds on /3(g, 8) 
for both these algorithms. Further, we also provide correspond- 
ing lower bounds on /3(q, 8) for any group-testing algorithms. 
These upper and lower bounds are asymptotically independent 
of both n and d. The lower bounds are information-theoretic, 
and the upper bounds are derived from a detailed analysis of 
CBP and COMP under both the noiseless and noisy scenarios. 
In general, the bounds resulting from the analysis of the 
algorithms match our simulations to a high degree, which 
indicates that the bounds we derive are not too slack. 

This paper is organized as follows. In Section III] we intro- 
duce the model and corresponding notation, and describe the 
algorithms analyzed in this work. In Section [ill] w e describe 
the main results of this work. Sections IIVI and IVl contain the 
analysis respectively our information-theoretic lower bounds, 
and of the group-testing algorithms considered. Our simulation 
results are presented in Section [VI] 

A. Prior Work 

Dorfman [ 1 1 first considered the group-testing problem dur- 
ing World- War II with regards to testing soldiers for syphilis. 
Since then, a large body of literature has considered the 
problem (see for instance the book by 0). 

In this work we focus on non-adaptive algorithms. Here we 
can further subdivide algorithms according to whether errors 
(that decay asymptotically to zero with large n) are allowed or 
not in the reconstruction algorithm, and, orthogonally, whether 
the measurements are noisy are not. 

If errors are not allowed in the group-testing algorithm, 
it is known that at least Q,{d 2 log(n)) tests are required in 
both noiseless and noisy scenarios (which may be consid- 
erably larger than the 0(dlog(n)) bounds that are known 
(for instance [3| and 0) for the "small-error" scenario. 
Further, in the noisy scenario, if no errors are allowed in the 
reconstruction algorithm, only noise patterns with an absolute 
bound on the total number of noisy measurements can be 
handlednFor these reasons, we choose to focus on algorithms 
in which a small probability of error is allowed - the reader 

3 This choice of nomenclature is motivated by two popular Compressive 
Sensing decoding algorithms, respectively Basis Pursuit, and Orthogonal 
Matching Pursuit - as we note in Section |J-A| the decoding algorithms we 
analyze in this work might be viewed as combinatorial analogues of those 
well-analyzed algorithms. 

4 This is because in the "usual" noise model, wherein each measurement 
may be noisy with a certain probability, there is a non-zero probability that 
an arbitrary fraction of the measurements are corrupted in an arbitrarily bad 
manner. In this case no group-testing algorithm can hope to decode with 
zero-error. 



interested in zero-error algorithms is encouraged to look at 

0, 0, Go), OH. HH- 

The works closest to ours are those of [3| and |7|. The 
former analyzes the performance of certain group-testing 
algorithms in both noiseless and noisy settings information- 
theoretically, and proves order-optimality. However, only 
order-optimal (rather than explicit) bounds on the number of 
tests required are provided, and also the algorithms analyzed 
are not provided, and also the algorithms analyzed are not 
computationally efficient. The work of [7] proposes a belief- 
propogation decoding rule to improve the computational effi- 
ciency of the algorithms of 13], but no proof of correctness is 
provided. In contrast, in this work we provide the first explicit 
bounds on computationally efficient group-testing algorithms. 

Information-theoretic lower bounds on the number of tests 
required are folklore - some instances of these bounds for 
some models are provided in ifTUll . Since we were unable to 
find a specific reference covering all cases for our model, we 



also prove our lower bounds in Section IV 



There are intriguing connections between the two algo- 
rithms we consider, and corresponding Compressive Sensing 
(CS) algorithms. In particular, Basis Pursuit has been well- 
analyzed in the CS literature (for instance lfT3l . lfl4ll ). as 
has Orthogonal Matching Pursuit (for instance [15|). The 
primary difference between those algorithms and the ones 
considered here is that in CS all measurements are over the 
real field H., whereas in group-testing the measurements are 
modeled instead as an OR of AND clauses (hence the term 
"Combinatorial"). 

II. Background 
A. Model and Notation 
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Fig. 1 . An example demonstrating a typical non-adaptive group-testing setup. 
The T X n binary group-testing matrix represents the items being tested in 
each test, the length-ra binary input vector x is a weight d vector encoding 
the locations of the d defective items in T>, the length-T binary vector y 
noiseless result denotes the outcomes of the group tests in the absence of 
noise, the length-T binary noisy result vector y denotes the actually observed 
noisy outcomes of the group tests, as the result of the noiseless result vector 
being perturbed by the length-T binary noise vector v. The length-n binary 
estimate vector x represents the estimated locations of the defective items. 



A set M contains n items, of which an unknown subset 



T> are said to be "defective"P| The goal of group-testing is 
to correctly identify the set of defective items via a minimal 
number of "group tests", as defined below (see Figure [T] for a 
graphical representation). 

Each row of a T x n binary group-testing matrix M 
corresponds to a distinct test, and each column corresponds 
to a distinct item. Hence the items that comprise the group 
being tested in the zth test are exactly those corresponding to 
columns containing a 1 in the ith location. The method of 
generating such a matrix M is part of the design of the group 
test - this and the other part, that of estimating the set T>, is 
described in Section Hl-B I 

The length-n binary input vector x represents the set J\f, 
and contains Is exactly in the locations corresponding to the 
items of T>. The locations with ones/defective items are said 
to be positive - the other locations are said to be negative. We 
use these terms interchangeably. 

The outcomes of the noiseless tests correspond to the length- 
T binary noiseless result vector y, with a 1 in the i location 
if and only if the zth test contains at least one defective item. 

The observed vector of test outcomes in the noisy scenario 
is denoted by the length-T binary noisy result vector y - the 
probability that each entry of y differs from the corresponding 
entry in y is q, where q is the noise parameter. The locations 
where the noiseless and the noisy result vectors differ is 
denoted by the length-T binary noise vector v, with Is in 
the locations where they differ. 

The estimate of the locations of the defective items is 
encoded in the length-n binary estimate vector, with Is in 
the locations where the group-testing algorithms described in 
Section Hl-BI estimate the defective items to be. 

The probability of error of any group-testing algorithm is 
defined as the probability (over the input vector x, group- 
testing matrix M , and noise vector v) that the estimated vector 
differs from the input vector. 

B. Algorithms 

We now describe the CBP and COMP algorithms in both 
the noiseless and noisy settings. The algorithms are specified 
by the choices of encoding matrices and decoding algorithms. 
1. NOISELESS ALGORITHMS 

Combinatorial Basis Pursuit (CBP): 

The T x n group-testing matrix M is defined as follows. A 
group sampling parameter g is chosen (the exact values of T 
and g are code-design parameters to be specified later). Then, 
the ith row of M is specified by sampling with replacement 
from the set [1, . . . , n] exactly g times, and setting the (i, j) 
location to be one if j is sampled at least once during this 

5 In this work, as is common (see for example 1161 ). we assume that the 
number d of defective items in T>, or at least a good upper bound on them, is 
known a priori. If not, other work (for example (171) considers non-adaptive 
algorithms with low query complexity that help estimate d. 



process, and zero otherwiseN 

The decoding algorithm proceeds by using only the tests 
which have a negative (zero) outcome, to identify all the non- 
defective items, and declaring all other items to be defective. 
If M is chosen to have enough rows (tests), each non-defective 
test should, with significant probability, appear in at least one 
negative test, and hence will be appropriately accounted for. 
Errors (false positives) occur when at least one non-defective 
item is not tested, or only occurs in positive tests (i.e., every 
test it occurs in has at least one defective item). The analysis 
of this type of algorithm comprises of estimating the trade-off 
between the number of tests and the probability of error. 

More formally, for all tests i whose measurement outcome 
yi is a zero, let rrij denote the corresponding ith row of M, 
and m(y) denote the length-n binary vector which has Is in 
exactly those locations where there is a 1 in at least one nij. 
The decoder sets x as 1 — m(y), i.e., it has zeroes where 
m(y) has ones, and vice versa. 

The rough correspondence between this algorithm and Basis 
Pursuit ([ 13 1, [ 14]) arises from the fact that, as in Basis Pursuit, 
the decoder attempts to find a "sparse" solution x that can 
generate the observed vector y. 
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Fig. 2. An example demonstrating the CBP algorithm. Based on only on the 
outcome of the negative tests (those with output zero), the decoder estimates 
the set of non-defective items, and "guesses" that the remaining items are 
defective. 

Combinatorial Orthogonal Matching Pursuit (COMP): 

The T x n group-testing matrix M is defined as follows. A 
group selection parameter p is chosen (the exact values of T 
and p are code-design parameters to be specified later). Then, 
i.i.d. for each (i, j), the (i, j)th element of M is set to be one 
with probability p, and zero otherwise. 

The decoding algorithm columns-wise, instead of row-wise 
as in CBP. It attempts to match the columns of M with the 
result vector y. That is, if a particular column j of M has the 
property that all locations i where it has ones also corresponds 
to ones in y, t in the result vector, then the jth item (xj) is 
declared to be defective (positive). All other items are declared 
to be non-defective (negative). 

s Note that this process of sampling each item in each test with replacement 
results in a slightly different distribution than if the group-size of each test was 
fixed a priori and hence the sampling was "without replacement" in each test. 
(For instance, in the process we define, each test may, with some probability, 
test fewer than g items.) The "without replacement" process is a perhaps 
more natural way of defining tests, and also experimentally seems to result in 
slightly better performing algorithms. However, the corresponding analysis is 
significantly trickier, and we have been unable to find closed form expressions 
for such "without replacement" sampling. The primary advantage of analyzing 
the "with replacement" sampling is that in the resulting group-testing matrix 
every entry is then chosen i.i.d.. 



Note that this decoding algorithm never has false negatives, 
only false positives. A false positive occurs when all locations 
with ones in the jth column of M (corresponding to a non- 
defective item j) are "hidden" by the ones of other columns 
corresponding to defectives items. That is, let columns j and 
some other columns j\ , . . . , j). of matrix M be such that 
for each i such that mij = 1, there exists an index f in 
{ii> • ■ • >jk} for which rriiji also equals 1. Then if each of 
the {ji, . . . , jfc}th items are defective, then the jth item will 
also always be declared as defective by the COMP decoder, 
regardless of whether or not it actually is. The probability of 
this event happening becomes smaller as the number of tests 
T become larger. Hence, as in CBP, the analysis of this type 
of algorithm comprises of estimating the trade-off between the 
number of tests and the probability of error. 

The rough correspondence between this algorithm and Or- 
thogonal Matching Pursuit (|15|) arises from the fact that, as 
in Orthogonal Matching Pursuit, the decoder attempts to match 
the columns of the group-testing matrix with the result vector. 
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Fig. 3. An example demonstrating the COMP algorithm. The algorithm 
matches columns of M to the result vector. As in (b) in the figure, since the 
result vector "contains" the 7th column, then the decoder declares that item to 
be defective. Conversely, as in (c), since there is no such containment of the 
last column, then the decoder declares that item to be non-defective. However, 
sometimes, as in (a), an item that is truly negative, is "hidden" by some other 
columns corresponding to defective items, leading to a false positive. 

2. NOISY ALGORITHMS 

Noisy Combinatorial Basis Pursuit (NCBP): 

Let K be design parameters to be specified later. To generate 
the M for the NCBP algorithm case, each row of M from the 
noiseless CBP algorithm is repeated K times. The decoder 
declares the result of each a particular set of K successive 
tests to be positive if at least K/2 of the tests in that group 
are positive, and else declares each such test to actually be 
negative. The decoder then uses the noiseless CBP algorithm 
to estimate T>. 

Noisy Combinatorial Orthogonal Matching Pursuit 
(NCOMP) 

Finally, we consider the algorithm whose analysis is the 
major result of this work. In the noisy COMP case, we 



relax the sharp-threshold requirement in the original COMP 
algorithm that the set of locations of ones in any column of 
M corresponding to a positive item be entirely contained in 
the set of locations of ones in the result vector. Instead, we 
allow for a certain number of "mismatches" - this number 
of mismatches depends on both the number of ones in each 
column, and also the noise parameter q. 

Let p and A be design parameters to be specified later. To 
generate the M for the NCOMP algorithm case, each element 
of M is selected i.i.d. with probability p to be 1. 

The decoder proceeds as follows, For each column i, we 
define the indicator set % as the set of indices j in that column 
where mjj = 1, We also define the matching set Si as the set 
of indices j where both jjj = 1 (corresponding to the noisy 



result vector) and m 



1,3 



1. 



Then the decoder uses the following "relaxed" thresholding 
rule. If |<Si| > |7i|(l — <?(1 + A)), then the decoder declares the 
ith item to be defective, else it declares it to be non-defective. 
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Fig. 4. An example demonstrating the NCOMP algorithm. The algorithm 
matches columns of M to the result vector up to a certain number of 
mismatches governed by a threshold. In this example, the threshold is set so 
that the number mismatches be less than the number of matches. For instance, 
in (b) above, the Is in the third column of the matrix match the Is in the 
result vector in two locations (the 5th and 7th rows), but do not match only 
in one location in the 4th row (locations wherein there are Os in the matrix 
columns but Is in the result vector do not count as mismatches). Hence the 
decoder declares that item to be defective, which is the correct decision. 
However, consider the false negative generated for the item in (c). This 
corresponds to the 7th item. The noise in the 2nd, 3rd and 4th rows of v 
means that there is only one match (in the 7th row) and two mismatches (2nd 
and 4th rows) - hence the decoder declares that item to be non-defective. 
Also, sometimes, as in (a), an item that is truly negative, has a sufficient 
number of measurement errors that the number of mismatches is reduced to 
be below the threshold, leading to a false positive. 



III. Main Results 



A. Lower Bounds 



We first provide information-theoretic lower bounds on the 
number of tests required by any group-testing algorithm. While 
we believe these bounds to be "common knowledge" in the 
field, we have been unable to pinpoint a reference that gives 
an explicit lower bound on the number of tests in terms of the 
acceptable probability of error of the group-testing algorithm. 



For the sake of completeness, so we can benchmark our 
analysis of the algorithms we present later, we state and prove 
the lower bounds here. All logarithms in this work are assumed 
to be binary. 

Theorem 1: [Folklore] Any group-testing algorithm with 
noiseless measurements that has a probability of error of at 
most e requires at least (1 — e)dlog(n/d) tests. 

In fact, the corresponding lower bounds can be extended to 
the scenario with noisy measurements as well. 

Theorem 2: [Folklore] Any group-testing algorithm that has 
measurements that are noisy i.i.d. with probability q and that 
has a probability of error of at most e requires at least [(1 — 
e)dlog(n/d)]/(l - H(q)) testsQ 

Note: Our assumption that d = o(n) implies that the bounds 
in Theorem [T] and [2] are fl(dlog(n)). 

B. Upper Bounds 

The main contributions of this work are explicit compu- 
tations of the number of tests required to give a desired 
probability of error via computationally efficient algorithms. 
In both the noiseless and noisy case, we consider two types 
of algorithms (CBP and COMP). Both these algorithms have 
been considered before in the literature (for instance, see |8|), 
but to the best of our knowledge ours is the first work to 
explicitly compute the tradeoff between the number of tests 
required to give a desired probability of error, rather than 
giving order of magnitude estimates of the number of tests 
required for a "reasonable" probability of success. 

Theorem 3: CBP with error probability at most rT s re- 
quires no more than 2(1 + 5)ed\nn tests. 

Theorem 4: COMP with error probability at most rT 5 
requires no more than ed(l + 5) ln(n) tests. 

Note that Theorems [3] and [4] is commensurate with the 
corresponding lower bound in Theorem [2] 

Translating these algorithms into the noisy measurement 
case is non-trivial. One approach is to repeat the tests in CBP, 
leading to NCBP and the following theorem. 

Theorem 5: NCBP with probability of error at most n~ s 
requires no more than 2e(l + <5)2(ln In n + In d + S In n + 1 + 
ln(2(l + <5)))(1 - 2q)- 2 dlog(n) tests. 

Note that this is asymptotically worse than the correspond- 
ing lower bound in Theorem [2] We therefore provide the main 
result of this paper, 

Theorem 6: NCOMP with error probability at most rT 5 
requires no more than 4.36(vo + y/i + S) 2 (l — 2q)~ 2 d\ogn 
tests. 

Note that the corresponding upper bound differs from 
the lower bound by a factor that is at most 4.36(Vo + 
y/l + S) 2 (l — 2q)~ 2 , which is a function only of q and S. 
for "small" q this quantity is "small". 

IV. Proof of lower bounds 

We begin by noting that X -$■ Y -> Y -> X (i.e. 
the input vector, noiseless result vector, noisy result vector, 

7 Here H (.) denotes the binary entropy function. 



and the estimate vector) forms a Markov chain. By standard 
information-theoretic definitions we have 

H(X) = #(X|X) + J(X;X) 

Since X is uniformly distributed over all length-n and <i-sparse 
data vectors, H(X.) = log \X\ = log Q). By Fano's inequality, 
#(X|X) < 1 + elogQ). Also, we have I(X;X) < I(Y;Y) 
by the data-processing inequality. Finally, note that 



1 



since the first term is maximized when each of the Y. L are inde- 
pendent, and because the measurement noise is memoryless. 
For the BSC(g) noise we consider in this work, this summation 
is at most T{\ — H(q)) by standard argumentsFj 
Combining the above inequalities, we obtain 



(l-e)logl) < 1+T(l-H(q)) 

Also, by standard arguments via Stirling's approximation lfl~8l . 
log Q) is at least dlog(n/d). Substituting this gives us the 
desired result 

1 - e , (n 

1 — e ,, fn\ 
* l^Hiq-) dl ° g (d)- 

D 

V. Proofs of upper bounds 
A. Noiseless Group Testing 

Proof of Theorem [5] 

The Coupon Collector's Problem (CCP) is a classical prob- 
lem that considers the following scenario. Suppose there are n 
types of coupons, each of which is equiprobable. A collector 
selects coupons (with replacement) until he has a coupon of 
each type. What is the distribution on his stopping time? 
It is well-known ([19]) that the expected stopping time is 
nlnn + 0(n). Also, reasonable bounds on the tail of the 
distribution are also known - for instance, it is known that 
the probability that the stopping time is more than x n ^ nn is 
at most n~ x+1 . 

Analogously to the above, we view the group-testing pro- 
cedure of CBP as a Coupon Collector Problem. Consider the 
following thought experiment. Suppose we consider any test 
as a length-g test-vector nwhose entries index the items being 
tested in that test (in this view, repeated entries are allowed in 
this vector). Due to the design of our group-testing procedure 
in CBP, the probability that any item occurs in any location of 

8 This technique also holds for more general types of discrete memoryless 
noise - for ease of presentation, in this work we focus on the simple case of 
the Binary Symmetric Channel. 

9 Note that this test-vector is different from the binary length-n vectors 
that specify tests in the group testing-matrix, though there is indeed a natural 
bijection between them. 



such a vector is uniform and independent. In fact this property 
(uniformity and independence of the value of each entry of 
each test) also holds across tests. Hence, the items in any 
subsequence of k tests may be viewed as the outcome of a 
process of selecting a single chain of gk coupons. This is still 
true even if we restrict ourselves solely to the tests that have 
a negative outcome. The goal of CBP may now be viewed 
as the task of collecting all the negative items. This can be 
summarized in the following equation 



Tg 



d 



> (n — d) ln(n — d). 



(1) 



Modifying (fTJ to obtain the corresponding tail bound on T 
takes a bit more work. The right-hand side is then modified to 
x(n — d) ln(n — d) (which corresponds to the probability that 
all types of coupons have not been collected if these many 
total coupons have been collected is at most (n — d)~ x+1 ). 
The left-hand side is multiplied with (1 — p), where p is a 
design parameter to be specified by Chernoff's bound on the 
probability that the actual number of items in the negative 
tests is smaller than (1 — p) times the expected number. By 

Chernoff's bound this is at most exp I —p 2 T ( 5 ^) ) • Taking 
the union bound over these two low-probability events gives 
us that the probability that 



(1 - p)Tg 



does not hold is at most 



> x( n ~ d) m ( n — d) 



exp -p z T 



+ {n-d) 



-x+i 



(2) 



(3) 



So, optimizing for g in (fTh and substituting g* = 
1/lnf^O into (2i, and noting that ( 5 ^) 9 equals e~\ 
we have 



T > 



X (n — d) ln(n — d) 
X (n — d) ln(n — d) 

x (n-d)\n(n-d)\n^) 
1- p ~ x 



(4) 



Using the inequality ln(l + a;) > x — x 2 /2 with x as dj(n — 
d) simplifies the RHS of Q to 

r ^ e (^^)) ln( ^ rf) - ^ 

Choosing T to be greater than the bound in (T5]) can only 
reduce the probability of error, hence choosing 



T > edliitn — d) 

1-p 



(6) 



Choosing p = \ and substituting (pb into Oil implies, for 
large enough d, the probability of error P e satisfies 

P e < e -^ dln{n - d) + (n - dy x+1 
= (n-d)-^ xd + (n-d)- x+1 
< 2(n-d)- x+1 . (7) 

Taking 2(n - d)~x +1 = n- s , we have X = 3^^ + 
log(n-d) + !■ For lar S e n, x approaches 6 + 1. 

Therefore, the probability of error is at most n , with 
sufficiently large n, the following number of tests suffice to 
fulfil the terms of the theorem 



T> 2(1 + 6)edh\i 



a 



Proof of Theorem [4) 

As noted in the discussion on COMP in Section |H-B| the 
error-events for the algorithm correspond to false positives, 
when a column of M corresponding to a non-defective item 
is "hidden" by other columns corresponding to defective items. 
To calculate this probability, recall that each entry of M 
equals one with probability p, i.i.d. Let j index a column of 
M corresponding to a non-defective item, and let ji, . . . , jd 
index the columns of M corresponding to defective items. 
Then the probability that rrii t j equals one, and at least one of 
rriijj, . . . ,iriij d also equals one is p(l — (1 — p) d ). Hence 
the probability that the jth column is hidden by a column 
corresponding to a defective item is (l — p(l — p) d ) . Taking 
the union bound over all n — d non-defective items gives us 
that the probability of false positives is bounded from above 



P e = P+<(n-d)(l-p(l-p) d ) 



(8) 



By differentiation, optimizing ([8]) with respect to p suggests 
choosing p as 1/d. Substituting this value back into ([8]), and 
setting T as /3d Inn gives us 

/ -. \ fid In n 

Pe < (n-d)(l- 



de 

< (n~d)e-^ llnn 



(9) 



Choosing /3 = (1 + 6)e thus ensures the required decay in 
the probability of error. Hence choosing T to be at least (1 + 
6)ed\nn suffices to prove the theorem. □. 

B. Noisy Group Testing 

We now consider the harder problem of group testing when 
the measurements are noisy. First, just as a benchmark, we 
consider using the noiseless CBP algorithm with each test 
repeated identically K times, where K is a parameter to be 
determined so as to ensure a probability of error that can be 
made to decay asymptotically in n. 



still implies a probability of error at most as large as in d3), Proof of Theorem HJ 



Since each test has a probability q of giving the wrong 
result, by the Chernoff bound the probability that more than 
the threshold number of tests give the incorrect result is at 
most e~ 2K (?~ qS> . Hence by the union bound, repeating each 
of the T tests K times, the probability that the decoder makes 
an error is at most 



(e- 2 ^-?) 2 ) 



P„<T (e- 2K ^-^ 



(10) 



Substituting T as (3d log n implies that for the probability 
TO} to approach zero asymptotically in n,K must be at least 



K > 



> 



2(Shxn + hxT) 

(1 - 2g) 2 
2(lnlnn + lnd + Sinn + 1 + ln(2(l + 6))) 
(1 - Zg) 2 



(11) 



□ 



Note: As noted earlier, the number of tests required by NCBP 
is larger than the corresponding lower bound in Theoremplby 
a factor that is larger than any constant. 

Proof of Theorem |6} 

Due to the presence of noise, both false positives and false 
negatives may occur in the noisy COMP algorithm - the 
overall probability of error is the sum of the probability of 
false positives and that of false negatives. We set p — a/d 
(where a is a code-design parameter to be determined later) 
and T — fid log n. We first calculate the probability of false 
positives by computing the probability that more than the 
expected number of ones get flipped to zero in the result vector 
in locations corresponding to ones in the column indexing the 
defective item. This can be computed as 

d 
P e 7 = \JP(\7l\=t)P(\S i \<\T i \(l-q(l + A))) 



t=0 



T 



< d^2 . p*(i-p) 



\T-t 



E 



r=t-t(l-g(l+A)) 

'T 



T(-\ J\t—T 



<f(l-«) 

d(l-p + pe- 2 ^ 
/ a a 

d v-- d + -d e 



" " -2( q A)^ m ° sn 



(12) 

(13) 

(14) 

(15) 
(16) 



< de -^-^^)^ sn (17) 

(18) 



< d e -^^-e- 2 )(qA) 2 \o S n 



Here, as in Section |H-B| % denotes the locations of ones 
in the ith column of M. Inequality ( fT~2| > follows from the 
union bound over the possible errors for each of the defective 
items, with the first summation accounting for the different 
possible sizes of %, and the second summation accounting 



for the error events corresponding to the number of one-to- 
zero flips exceeding the threshold chosen by the algorithm. 



Inequality ( 14 1 follows from the Chernoff bound. Equality ( 15 1 
comes from the binomial theorem. Equality ( [To) comes from 
substituting in the values of p and T. Inequality (17i follows 
from the leading terms of the Taylor series of the exponential 



function. Inequality (18 1 follows from an appropriate linear 



lower bound to the concave function 1 — e~ x . 

For the requirement that the probability of false nega- 
tives be at most n~ s to be satisfied implies that [3~ (the 
bound on /? due to this restriction) be at least (a(l — 
e^ 2 )(gA) 2 )- 1 ((lnd/lnn) + <S)ln2. Since d = o{n) this 
converges to 



> 



5 In 2 



a(l-e- 2 )(qAy 



(19) 



We now focus on the probability of false positives. In the 
noiseless CBP algorithm, the only way a false positive could 
occur was if all the ones in a column are hidden by ones 
in columns corresponding to defective items. In the noisy 
CBP algorithm this still happens, but in addition noise could 
also lead to a similar masking effect. That is, even in the 
1 locations of a non-defective column not hidden by other 
defective columns, measurement noise flips enough zeroes 
to ones so that the decoding threshold is exceeded, and the 
decoder declares that particular item to be defective. See 
Figure H|a) for an example. 

Hence we define a new quantity a, which denotes the 
probability for any (i,j)th location in M that a 1 in that 
location is "hidden by other columns or by noise". It equals 

a=l-[(l-q)(l-p) d + q(l-(l-p) d )}. 

To facilitate our analysis, as (l + ^) < e x for n > 0, we 
bound 

a = l-q-(l-p) d (l-2q) 
= l-q-(l-^) d (l-2q) 
>(l-q)-e- a (l-2q). (20) 

The probability of false positives is then computed in a similar 



manner to that of false negatives as in ( 12 1-( 18 1. 

n—d 

P+ = \J P(\T i \=t)P(\S i \>\T i \(l-q(l + A))) 



(=i 



<(»-<*)£ Jp*(i-p) 



t=0 



T 



\T-t 



E ;k(i-«)*- r 

-r=t(l-g(l+A)) V J 
<( n -d)(l-p + p e -2((l-9(l+A))-a) 2 y 

<(n-d) 

(l-p + pe- 2 ^^ 1 - 2 ^-^ 2 
<{n~d) 



(21) 
(22) 



y (l-2q)-Aq] Z 



log n 



<{n-d) 



= _ Q/ g(l_ e -2)( e -°(i_ 2g )-A q ) 2 logr 



(23) 



Note that for the Chernoff bound to applicable in (21 1, 1 — 
q(l + A) > q. Equation (22 1 follows from substituting the 



bound derived on a in (20 1 into (21 1. For the requirement that 
the probability of false positives be at most n~ s to be satisfied 
implies that f3 + (the bound on j3 due to this restriction) be at 
least (a(l-e- 2 )(e-~ a {l~2q)-Aq) 2 )- 1 ((\n(n-d))/(lnn) + 
S) In 2. Since d = o(n) this converges to 



/3 + > 



(l + <5)ln2 



a(l - e- 2 )(e- a (l - 2q) - Aq) 2 



(24) 



Note that j3 must be at least as large as max{/3 _ ,/? + } so 
that both ([19) and |24| are satisfied. 

When the threshold in the noisy COMP algorithm is high 
(i.e., A is small) then the probability of false negatives 
increases; conversely, the threshold being low (A being large) 
increases the probability of false positives. Algebraically, this 
expresses as the condition that A > (else the probability of 
false negatives is significant), and conversely to the condition 
that 1 — q(l + A) > a (so that the Chernoff bound can 
be used in pi) )) - combined with ( |20| this implies that 
A < e~ a (l - 2q)/q. For fixed a, each of ^ and ((24]) as a 
function of A is a reciprocal of a parabola, with a pole the 
corresponding extremal value of A. Furthermore, j3~ is strictly 
increasing and /3 + is strictly decreasing in the region of valid 
A in (0, e~ Q (l — 2q)/q). Hence the corresponding curves on 
the right-hand sides of ( fl~9] > and ( |24| intersect within the region 
of valid A, and a good choice for (3 is at the A where these 
two curves intersect. Let 



7 = (Ind + Slnn)/(hx(n — d) + Sinn). 



(Note that for large n, since d = o(n), 7 approaches 5/(1 4- 
S).) Then equating the RHS of ([19]) and (|24| implies that the 
optimal A* satisfies 



In 2 



7ln2 



a(l 



i ){e- a {l-2q)-Aq) 2 a{\ - e' 2 )A 2 q 



126) 



Simplifying (26) gives us that 



A* = 



e- a {l~2q) 

g(i + 7- 1/2 )' 



(27) 



Substituting (27) into (24 1 we see that the resulting function 
can be viewed as e 2a /a times factors that are independent of 
a. Optimizing this with respect to a indicates that the minimal 
value of (3 occurs when a = 0.5. 



Substituting these values of a, 7 and A into (19) gives us 
the explicit bound 



„ 2e(VS + VT+S) 2 ln2 AM(VS + VT+S) 2 .,,, 

P — ~Ti I^TT^ n n ~ T-, 7T^> ■ \ l °) 




Esp=rimenital;q=0 

Experimeratal;q=G.l 

Experimental;q=0.2 

- Theoretical-low -er;-q=^) 
-Tfteoretical-lflwerT r q=0.1 

- Theoretics l-l ovjer; q=0. 2 

- Theoretica l-upper;q=0 

- Theoretics l-loiMer;.Qj=0-l 

- Theoretics l-l ovj-er; -q|=D. 1 



1000 1500 2000 

number of tests (T) 



253:i 



Fig. 5. The probability of success for Noisy-COMP as a function of the 
number of tests T, for different values of the noise parameter q. 
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>){l-2qf 



(l-2# 



VI. Simulation 

A. Noisy Random Incidence Algorithm 

We performed extensive simulations to validate our theoret- 
ical analysis. In the interest of space we present only Figure B] 
which examines the probability of error of Noisy-COMP as 
a function of the number of tests. Note that the experimental 
values we obtain correlate well with the corresponding bounds. 
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